Exercise_LCS

Author

Laura Cunha Silva

House Prices in Portugal

This dataset entails data from 2008-2022 on the real Residential Property Prices for Portugal. Data was downloaded from: https://fred.stlouisfed.org/series/QPTR628BIS

Reference: Bank for International Settlements, Real Residential Property Prices for Portugal [QPTR628BIS], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/QPTR628BIS, June 15, 2023.

Import data

setwd("C:/Users/Laura/Desktop/BasicStats_PHS")
house_prices <- read.csv("house_prices.csv", sep = ",")
#house_prices <- as.data.frame(apply(house_prices, 2, gsub, pattern = "([0-9])\\.([0-4])", replacement= "\\1\\2"), stringsAsFactors = FALSE) #remove the thousand separator 
str(house_prices)
'data.frame':   60 obs. of  2 variables:
 $ DATE      : chr  "1/1/2008" "4/1/2008" "7/1/2008" "10/1/2008" ...
 $ QPTR628BIS: num  103.4 101.9 99.4 98.2 100 ...
house_pricesbck <- house_prices #backup 

Load packages

library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggplot2")
library("plotly")

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library("lubridate")
library("stats")

We have two variables:

  • DATE: character; date of recorded average price value of houses in Portugal

  • QPTR628BIS: character; recorded average price value of houses in Portugal. However, this variable has its values separated with the thousand separator. This is ok, we must however keep in mind that the values are expressed in thousands.

We must now make date be read as a date variable in order to accuratly evaluate how the prices have fluctuated over time.

house_prices$DATE <- dmy(house_prices$DATE)
house_prices <- house_prices %>% 
  rename(date = DATE) %>% 
  mutate(price = as.numeric(QPTR628BIS)) %>%
  select(-QPTR628BIS)

We want to focus our analysis solely on the years from pre and post COVID-19 pandemic

First case detected in Portugal in March 2020
house_prices <- house_prices %>% 
  mutate(year = year(date))

Data description

Basics

summary(house_prices$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  80.04   87.44   99.85  104.47  116.81  147.70 

Minimum value for a residential property in Portugal from 2008-2022 was around 80K euros. Mean value is 104K euros. Maximum value registered was of 147.7K euros.

Let’s evaluate the values dispersion through a boxplot

For better visualization purposes we will use plotly.

fig <- plot_ly(house_prices, x = ~price, color = ~as.factor(year), type = "box")
fig 
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors

We can see that the prices have been increasing since 2017, with 2021 being the year where values most fluctuated throughout the year. It seems that there are differences between the pre COVID years (<2020 & >2016 ) and the post COVID years (>= 2020) in terms of prices. However, let’s prove this theory through statistical testing.

Let’s divide our data into pre and post covid dates.

prices_pre <- house_prices %>% 
  filter(year < 2020 & year > 2016)
unique(prices_pre$year)
[1] 2017 2018 2019
prices_pro <- house_prices %>% 
  filter(year >= 2020)
unique(prices_pro$year)
[1] 2020 2021 2022

Distribution of prices

Pre-COVID

# plot
p1 <- prices_pre %>%
  ggplot( aes(x=price)) +
    geom_histogram( binwidth=3, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Bin size = 3") +
    theme(
      plot.title = element_text(size=15)
    )
p1

Post-Covid

# plot
p1 <- prices_pro %>%
  ggplot( aes(x=price)) +
    geom_histogram( binwidth=3, fill="orange", color="#e9ecef", alpha=0.9) +
    ggtitle("Bin size = 3") +
    theme(
      plot.title = element_text(size=15)
    )
p1

From the histograms we can assume the data not to be normally distributed in neither the pre nor post-covid timeframes.

However, let’s do a QQ plot as well.

QQ plot

Generate a quantile-quantile (QQ) plot using geom_qq and geom_qq_line to assess its alignment with the theoretical line.

prices_pre %>%
  ggplot(aes(sample = price)) + 
  geom_qq_line(distribution = stats::qnorm) +
  geom_qq(color = "steelblue", distribution = stats::qnorm) + 
  xlab("Theoretical Quantiles") +
  ylab("Sample Quantiles") +
  theme_bw()

prices_pro %>%
  ggplot(aes(sample = price)) + 
  geom_qq_line(distribution = stats::qnorm) +
  geom_qq(color = "orange", distribution = stats::qnorm) + 
  xlab("Theoretical Quantiles") +
  ylab("Sample Quantiles") +
  theme_bw()

Our results are clearly compatible with the histograms above, which is expected.

Test normality

The null hypothesis of these tests is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.

Since we have quite a small sample size for the post-covid timeframe we have to be aware that normality tests are sensitive to sample size. Small samples most often pass normality tests which is why we combined visual inspection and significance test in order to take the right decision.

shapiro.test(prices_pre$price)

    Shapiro-Wilk normality test

data:  prices_pre$price
W = 0.95219, p-value = 0.6692

From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume normality.

shapiro.test(prices_pro$price)

    Shapiro-Wilk normality test

data:  prices_pro$price
W = 0.87883, p-value = 0.08468

From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume normality.

However, this seems to be a case influenced by the small sample size which affects the sensitivity of the test. When we check with the plots above we can tell that the data is not normally distributed.

Compare price data before and after COVID

In order to choose the correct test to use we must take into consideration that: - Our data is not normally distributed - Our samples are not independent (they are related as they were measured over time)

When we have paired or related samples, we must consider using the Wilcoxon signed-rank test, which is appropriate when comparing two related groups with non-normally distributed data.

res <- wilcox.test(prices_pre$price, prices_pro$price, paired = TRUE)
res

    Wilcoxon signed rank exact test

data:  prices_pre$price and prices_pro$price
V = 0, p-value = 0.0004883
alternative hypothesis: true location shift is not equal to 0

The p-value of the test is 0.0004883, which is less than the significance level alpha = 0.05. We can conclude that the median price of the houses in Portugal before COVID is significantly different from the median prices after COVID with a p-value = 0.0004883.

Housing crisis in Portugal on the news

*END